174 research outputs found

    Efficient processing of similarity queries with applications

    Get PDF
    Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance. The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL). In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins). In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system

    Synthesis and Characterization of pH-Responsive Elastin-Like Polypeptides with Different Configurations

    Get PDF
    Elastin-like polypeptides (ELP) are environmentally responsive polymers that exhibit phase transition behavior in response to various stimuli such as temperature, pH and irradiation. In this work, we focused on pH-responsive ELPs. It has been shown that pH value, configurations of ELPs, ELP concentration, and salt concentration can impact the phase transition behavior of pH-responsive ELPs. Quantitative models have been developed to engineer various linear ELPs, but there has not been any detailed study for engineering ionizable ELP with non-linear configurations. In this study, we designed, synthesized, and characterized pH-responsive ELPs with two different configurations. One is linear ELP, (GVGVPGEGVPGVGVP)12, and the other one is a three-armed star ELP, (GVGVPGEGVPGVGVP)12-foldon. Three ELP-foldon chains fold as a trimer in solutions. These constructs were synthesized by molecular biology techniques and characterized by different methods. A modified model based on previous studies was used to describe the pH-dependent reversible phase transition in response to their solution environment. The model was validated with pH-responsive ELPs with different configurations, i.e., linear and trimer. The phase transition of both ELPs fit this model across a range of pH. These results demonstrate that we are able to rationally design responsive polypeptides with different configurations whose transition can be triggered at a specified p

    PRESEE: An MDL/MML Algorithm to Time-Series Stream Segmenting

    Get PDF
    Time-series stream is one of the most common data types in data mining field. It is prevalent in fields such as stock market, ecology, and medical care. Segmentation is a key step to accelerate the processing speed of time-series stream mining. Previous algorithms for segmentingmainly focused on the issue of ameliorating precision instead of payingmuch attention to the efficiency. Moreover, the performance of these algorithms depends heavily on parameters, which are hard for the users to set. In this paper, we propose PRESEE (parameter-free, real-time, and scalable time-series stream segmenting algorithm), which greatly improves the efficiency of time-series stream segmenting. PRESEE is based on both MDL (minimum description length) and MML (minimum message length) methods, which could segment the data automatically. To evaluate the performance of PRESEE, we conduct several experiments on time-series streams of different types and compare it with the state-of-art algorithm. The empirical results show that PRESEE is very efficient for real-time stream datasets by improving segmenting speed nearly ten times. The novelty of this algorithm is further demonstrated by the application of PRESEE in segmenting real-time stream datasets from ChinaFLUX sensor networks data stream

    Adaptive Processing of Spatial-Keyword Data Over a Distributed Streaming Cluster

    Full text link
    The widespread use of GPS-enabled smartphones along with the popularity of micro-blogging and social networking applications, e.g., Twitter and Facebook, has resulted in the generation of huge streams of geo-tagged textual data. Many applications require real-time processing of these streams. For example, location-based e-coupon and ad-targeting systems enable advertisers to register millions of ads to millions of users. The number of users is typically very high and they are continuously moving, and the ads change frequently as well. Hence sending the right ad to the matching users is very challenging. Existing streaming systems are either centralized or are not spatial-keyword aware, and cannot efficiently support the processing of rapidly arriving spatial-keyword data streams. This paper presents Tornado, a distributed spatial-keyword stream processing system. Tornado features routing units to fairly distribute the workload, and furthermore, co-locate the data objects and the corresponding queries at the same processing units. The routing units use the Augmented-Grid, a novel structure that is equipped with an efficient search algorithm for distributing the data objects and queries. Tornado uses evaluators to process the data objects against the queries. The routing units minimize the redundant communication by not sending data updates for processing when these updates do not match any query. By applying dynamically evaluated cost formulae that continuously represent the processing overhead at each evaluator, Tornado is adaptive to changes in the workload. Extensive experimental evaluation using spatio-textual range queries over real Twitter data indicates that Tornado outperforms the non-spatio-textually aware approaches by up to two orders of magnitude in terms of the overall system throughput

    Underwater dual manipulators-Part I: Hydrodynamics analysis and computation

    Get PDF
    1098-1103This paper introduces two 4-DOF underwater manipulators mounted on autonomous underwater vehicle (AUV) with grasping claws, such that the AUV can accomplish the underwater task by using dual manipulators. Mechanical design of the manipulator is briefly presented and the feature of the simple structure of dual manipulators is simulated by using Solid Works. In addition, the hydrodynamics of the manipulator is analyzed, considering drag force, added mass and buoyancy. Then, hydrodynamic simulations of the manipulator are conducted by using 3-D model with Adams software, from which the torque of each joint is calculated. This paper presents an integrated result of computed torques by combining the theoretical calculation and simulation results, which is instrumental in determining the driving torque of the manipulators
    corecore